I wanted to build on my previous hypothesis that distance could be used to predict the fare of a route by adding the number of passengers who fly the route per day on average. I feel like more popular flights would be cheaper than those with low flight traffic. Additionally I felt that the best 3rd explanatory variable to include in this analysis was the relationship between distance and passengers. The other options based on the provided dataset just didn’t seem to mesh as well with the two that I have included already.
\[ \underbrace{Y_i}_\text{fare} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{base fare}}} + \overbrace{\beta_1}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{distance} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{passen} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{dist:passen} + \epsilon_i \]
Below is the Multiple regression result using distance, passengers, and the distance/passenger relationship.
lm.mult <-lm(fare ~ dist + passen + dist:passen, data=IO_airfare)
summary(lm.mult) %>%
pander(caption= "HW 3 Simple Multiple regression results")
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 118.7 | 2.046 | 58 | 0 |
| dist | 0.06534 | 0.001711 | 38.2 | 1.803e-277 |
| passen | -0.0212 | 0.001724 | -12.3 | 3.192e-34 |
| dist:passen | 1.537e-05 | 1.51e-06 | 10.18 | 4.511e-24 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4595 | 57.62 | 0.4083 | 0.4079 |
Below is the Multiple regression result using distance, passengers, but without the distance/passenger relationship.
lm.mult2 <-lm(fare ~ dist + passen, data=IO_airfare)
summary(lm.mult2) %>%
pander(caption= "HW 3 Simple Multiple regression w/o Interaction")
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 108.8 | 1.823 | 59.69 | 0 |
| dist | 0.07541 | 0.001411 | 53.45 | 0 |
| passen | -0.007297 | 0.001063 | -6.864 | 7.574e-12 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4595 | 58.26 | 0.395 | 0.3947 |
Here are the results from HW 2 regression, prediction of fare using just distance.
lm.sim <-lm(fare ~ dist, data=IO_airfare)
summary(lm.sim) %>%
pander(caption= "HW 2 simple regression results")
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 103.3 | 1.643 | 62.87 | 0 |
| dist | 0.07631 | 0.001412 | 54.05 | 0 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 4595 | 58.55 | 0.3888 | 0.3886 |
Here is the original equation for the regression with the appropriate coefficients now included.
\[ \underbrace{Y_i}_\text{fare} \underbrace{=}_{\sim} \overbrace{118.7}^{\stackrel{\text{y-int}}{\text{base fare}}} + \overbrace{0.06534}^{\stackrel{\text{slope}}{\text{baseline}}} \underbrace{X_{1i}}_\text{distance} + \overbrace{-0.0212}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{passen} + \overbrace{1.537e-05}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{dist:passen} + \epsilon_i \]
#b <- coef(lm.mult)
## Hint: library(car) has a scatterplot 3d function which is simple to use
# but the code should only be run in your console, not knit.
#library(car)
#scatter3d(fare ~ dist + passen, data=IO_airfare)
## To embed the 3d-scatterplot inside of your html document is harder.
#Perform the multiple regression
#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.5
#Setup Axis
axis_x <- seq(min(IO_airfare$dist), max(IO_airfare$dist), by = graph_reso)
axis_y <- seq(min(IO_airfare$passen), max(IO_airfare$passen), by = graph_reso)
#Sample points
lmnew <- expand.grid(dist = axis_x, passen = axis_y, KEEP.OUT.ATTRS=F)
lmnew$Z <- predict.lm(lm.mult, newdata = lmnew)
lmnew <- acast(lmnew, passen ~ dist, value.var = "Z") #y ~ x
#Create scatterplot
plot_ly(IO_airfare,
x = ~dist,
y = ~passen,
z = ~fare,
text = rownames(IO_airfare),
type = "scatter3d",
mode = "markers", color = ~fare)
#add_trace(z = lmnew,
# x = axis_x,
# y = axis_y,
# type = "surface")
Based on the multiple regression, the base cost of a ticket would be $118.70, for each additional mile the fare would increase by $0.065 and for each additional passenger the fare would decrease by $0.021. The strength or the relationship between Distance and passengers is ~0. The P-values for each of these terms are all incredibly close to 0. Although it is worth noting that the probability of the distance variable is significantly lower than that of the passengers or relationship, it is much more powerful in estimating fare than passenger count or the relationship.
These relationships are visble best when viewing the 3d plot. It is quickly apparent that distance is a signicant estimator due to the clustering of points along its distance plane.
par(mfrow=c(1,3))
plot(lm.mult,which=1:2)
plot(lm.mult$residuals)
How can it be that the R2 is smaller when the variable age is added to the equation?